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Abstract 

This  paper  presents  the  design  and  the  implementation 
of  a  compiler  and  runtime  infrastructure  for  automatic  pro¬ 
gram  distribution.  We  are  building  a  research  infrastructure 
that  enables  experimentation  with  various  program  parti¬ 
tioning  and  mapping  strategies  and  the  study  of  automatic 
distribution  ’  s  effect  on  resource  consumption  ( e.g.,  CPU, 
memory,  communication).  Since  many  optimization  tech¬ 
niques  are  faced  with  conflicting  optimization  targets  (e.g., 
memory  and  communication),  we  believe  that  it  is  impor¬ 
tant  to  be  able  to  study  their  interaction. 

We  present  a  set  of  techniques  that  enable  flexible  re¬ 
source  modeling  and  program  distribution.  These  are:  de¬ 
pendence  analysis,  weighted  graph  partitioning,  code  and 
communication  generation,  and  profiling.  We  have  devel¬ 
oped  these  ideas  in  the  context  of  the  Java  language.  We 
present  in  detail  the  design  and  implementation  of  each  of 
the  techniques  as  part  of  our  compiler  and  runtime  infras¬ 
tructure.  Then,  we  evaluate  our  design  and  present  prelim¬ 
inary  experimental  data  for  each  component,  as  well  as  for 
the  entire  system. 


1.  Introduction 

There  are  important  potential  benefits  of  automatic  over 
manual  program  distribution,  such  as  correctness,  increased 
productivity,  adaptive  execution,  concurrency  exploitation. 
This  paper  describes  a  new  approach  to  automatic  program 
distribution.  In  contrast  with  previous  work,  instead  of  con¬ 
sidering  a  particular  class  of  programs  and  optimization 
targets,  we  consider  general-purpose  programs  and  study 
multiple  optimization  targets.  Our  system  accepts  a  mono¬ 
lithic  program  and  transforms  it  into  multiple  communicat¬ 
ing  parts  in  networked  systems. 


*  Parts  of  this  research  were  funded  under  ONR  award  N0001 4-01-1- 
0854  and  NSF  award  CNS-0205712. 


1.1.  Possible  Uses 

Our  approach  places  high  emphasis  on  the  generality  of 
the  distribution  strategy  and  the  ability  to  build  an  abstract 
model  of  the  execution  environment.  Then,  the  distribution 
strategy  can  be  specialized  to  concrete  environments.  We 
recognize  that  this  approach  may  not  be  suitable  for  all  com¬ 
putations.  Many  programs  may  not  need  distribution  at  all. 

In  some  cases,  however,  automatic  distribution  is  cru¬ 
cial.  New  technologies  such  as  pervasive  computing  require 
that  applications  connect  from  any  device,  over  any  net¬ 
work,  using  any  style  of  interface.  Mobile  computing  re¬ 
quires  that  mobile  code  is  deployed  over  heterogeneous  net¬ 
works  of  sometimes  resource  constrained  devices.  If  there 
are  not  enough  resources  available  to  accommodate  a  given 
program  on  a  single  computing  node,  the  promises  of  these 
technologies  cannot  be  delivered.  In  this  context,  automatic 
distribution  can  help  with  increased  accessibility,  resource 
sharing,  and  load  balancing. 

Another  broad  class  of  data  intensive  applications  relies 
on  networked  systems  to  process  their  data  concurrently. 
Such  applications  range  from  inherently  concurrent  appli¬ 
cations  like  image  processing,  universe  exploration,  com¬ 
puter  supported  cooperative  work,  to  loosely  concurrent  ap¬ 
plications  such  as  fluid  mechanics  in  avionics  and  marine 
structures.  In  this  context,  automatic  distribution  can  help 
with  exploiting  concurrency,  reducing  the  execution  time, 
and  increasing  scalability. 

Our  specific  technical  contributions  relative  to  previous 
systems  with  similar  goals  are: 

•  A  set  of  techniques  for  a  novel  approach  to  automatic 
program  distribution.  These  techniques  are:  object  de¬ 
pendence  graph  construction,  general  graph  partition¬ 
ing,  automatic  communication  generation,  and  auto¬ 
matically  distributed  program  execution. 

•  An  original  compiler  and  runtime  infrastructure  that 
implements  all  the  above  techniques  to  allow  flexible 
program  distribution  based  on  program  access  pattern, 
resource  requirements,  and  resource  availability. 
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Figure  1.  The  distributed  compiler  and  run¬ 
time  infrastructure. 


1.2.  Basic  Approach 

Our  compiler  and  runtime  infrastructure  is  depicted  in 
Figure  1.  This  system  transforms  sequential  Java  programs 
into  distributed  programs.  Moreover,  the  system  attempts  to 
model  the  resources  needed  by  the  sequential  program  and 
distribute  the  program  based  on  the  resource  availability  in 
a  networked  system.To  this  effect,  the  system  performs  the 
following  transformations: 

1.  The  front-end  transforms  Java  bytecode  into  the  in¬ 
termediate  representation  using  Joeq  front-end  [22], 
Joeq  provides  us  with  two  intermediate  representa¬ 
tions:  bytecode  and  quad.  The  latter  is  a  quadruple 
style  IR  which  resembles  register-based  representa¬ 
tions. 


proximate  the  object  dependence  graph  for  a  program 
and  model  its  resource  requirements. 

3.  The  system  partitions  the  object  dependence  graph  us¬ 
ing  a  Java  wrapper  of  the  Metis  graph  partitioning 
tool  [14], 

4.  The  system  uses  bytecode  rewriting  to  insert  com¬ 
munication  calls  for  remote  dependences  in  the  par¬ 
titioned  program.  Also,  the  system  uses  a  bottom-up 
rewrite  system  to  generate  target  code  for  the  vari¬ 
ous  platforms  making-up  the  networked  configuration. 
For  better  resource  utilization,  in  the  future  we  plan 
to  use  native  execution  rather  than  Java  Virtual  Ma¬ 
chine  (JVM)  hosted  execution  on  (possibly  resource- 
constrained)  devices. 

5.  The  system  monitors  the  program  execution  and  col¬ 
lects  a  set  of  statistics  about  resource  usage.  We  use 
this  information  to  gain  insight  into  static  partitioning. 
In  the  future  we  plan  to  use  this  information  to  per¬ 
form  adaptive  repartitioning. 

The  rest  of  the  paper  is  organized  as  follows.  Section  2 
describes  the  object  dependence  graph  construction.  Sec¬ 
tion  3  explains  how  we  use  graph  partitioning  to  model  re¬ 
sources  and  split  the  program  into  multiple  pieces.  Section  4 
describes  in  detail  our  approach  to  code  and  communica¬ 
tion  generation.  Section  5  presents  the  implementation  of 
a  runtime  system  that  allows  automatic  distributed  execu¬ 
tion.  Section  6  presents  the  design  and  implementation  of 
a  mixed  instrumentation  and  sampling  profiler  that  moni¬ 
tors  programs  during  execution.  Section  7  discusses  an  ini¬ 
tial  evaluation  of  the  techniques  we  introduce  in  this  paper. 
Section  8  reviews  related  research  and  contrasts  our  effort 
from  previous  approaches  similar  in  goals.  Section  9  con¬ 
cludes  the  paper  and  underlines  future  research  directions. 

2.  Dependence  Graph  Construction 

The  first  transformation  our  system  performs  is  to  cre¬ 
ate  the  dependence  graph  of  the  program.  This  graph  de¬ 
picts  the  dependences  between  program  objects  and  serves 
as  the  input  for  the  resource  modeling  and  graph  partition¬ 
ing  phase.  We  use  a  concrete  example  to  illustrate  the  de¬ 
pendence  graph  construction. 

2.1.  An  Example 

Figure  2  shows  an  example  of  a  Java  program  that  we 
use  throughout  the  paper.  In  our  example,  there  are  two 
classes.  The  Account  class  describes  a  bank  account  with 
a  unique  identifier,  holder  name,  checking,  savings,  and 
loan.  The  Bank  class  describes  a  banking  institution  with 
a  unique  identifier,  name,  number  of  customers,  and  a  list 
(java .  lang .  Vector)  of  their  actual  accounts. 


2.  The  system  uses  our  static  analysis  framework  to  ap¬ 


public  class  Account  { 

} 

public  class  Bank  { 

protected  Bank (String  name,  int  numCustomers,  int  initialBalance, )  { 
initializeAccounts (initialBalance) ; 

} 

private  void  initializeAccounts (int  initialBalance)  { 
while  (numCustomers)  { 

Account  a  =  new  Account (i,  n,  s,  c) ; 
accounts . add (a) ; 
numCustomers — ; 

} 

} 

public  void  openAccount (Account  a) { 
accounts . add (a) ; 

} 

public  boolean  withdraw (int  customerlD,  int  amount)  { 
if  (...)  { 

this . getCustomercustomerlD) . setBalance ( 

this. getCustomer (customerlD) .getBalance ()  -  amount); 
return  true; 

}  else  return  false; 

} 

public  static  void  main (String [ ]  args)  { 

Bank  merchants  =  new  Bank ("Merchants”,  100,  10000); 

Account  a4  =  new  Account (1,  "ABC  Market",  1000000,  100000,  20000000); 
Account  a5  =  new  Account (2,  "CDE  Outlet",  5000000,  300000,  150000000); 
merchants . openAccount (a4) ; 
merchants . openAccount (a5) ; 

Account  a  =  merchants . getCustomer (2) ; 
merchants.withdraw(a.getld() ,  900) ; 

} 

} 


Figure  2.  An  example  of  a  Java  program. 

The  Bank  class  initializes  a  number  of  Account 
structures  for  its  clients.  On  an  openAccount  event 
an  Account  reference  is  passed  to  the  Bank  ob¬ 
ject  and  it  is  added  to  the  existing  accounts  list.  The 
Bank .  withdraw  (... )  method  reduces  the  bal¬ 
ance  by  the  amount  withdrawn.  The  main  method  cre¬ 
ates  instances  of  a  bank  and  various  types  of  accounts 
that  are  opened  and  operated  on.  Our  analysis  is  tar¬ 
geted  toward  finding  these  instances  and  their  dependences. 

We  have  implemented  an  improved  version  of  Spiegel’s 
algorithm  [20]  (for  detailed  contrast  see  [2]).  We  use  rapid 
type  analysis  (RTA)  to  compute  the  call  graph  and  the  pro¬ 
gram  types.  Then,  for  each  method  in  the  graph  we  compute 
the  class  relations  by  looking  at  field  access  and  method  call 
statements.  A  usage  relation  between  two  classes  occurs 
when  one  class  calls  methods  or  accesses  fields  of  another 
class.  Export  or  import  relations  occur  when  new  types  may 
propagate  from  one  class  to  another  through  field  accesses 
or  method  calls. 

Figure  3  shows  the  class  relation  graph  for  our  example. 
We  use  the  aiSee1  tool  for  the  visualization  of  the  graph  in 
the  Visualising  Compiler  Graphs  (VCG)  format.  The  types 
are  annotated  with  the  ST_  or  DT_  prefix  to  indicate  static 
or  instance  (dynamic)  parts  of  a  class.  The  use  relations  tell 
that  some  classes  occur  in  the  context  of  other  classes  and 
their  occurrence  is  noted  by  looking  at  the  method  calls, 

1  A  graph  visualization  tool  from  Abslnt.  Available  from 
http://www.absint.com/aisee.html. 


Figure  3.  The  class  relation  graph  visualized 
with  aiSee  tool  for  vcg  format. 


Figure  4.  The  object  dependence  graph  visu¬ 
alized  with  aiSee  tool  for  vcg  format. 


field  accesses,  and  allocation  statements.  The  export  edge 
occurs  due  to  the  invocation  of  the  openAccount  method 
on  the  dynamic  Bank  class  with  an  Account  class  as  pa¬ 
rameter.  The  import  edge  occurs  due  to  the  getCustomer 
invocation  that  returns  a  result  of  Account  type. 

Given  the  class  relation  graph,  and  the  object  set,  we 
compute  the  relation  between  the  corresponding  objects 
(class  instances).  For  each  allocation  statement,  we  add  ref¬ 
erence  relations  between  the  instance  of  the  class  where 
the  allocation  takes  place  and  the  newly  created  instance. 
We  then  create  new  references  by  matching  the  initial  ob¬ 
ject  references  against  the  export  and  import  relations  be¬ 
tween  the  corresponding  classes.  We  iterate  through  all  ob¬ 
ject  triples  and  propagate  references  matching  against  the 
type  relations  until  the  algorithm  reaches  a  fix  point. 

Figure  4  shows  the  object  graph  for  our  example.  The 


edges  are  labeled  by  create,  use,  reference.  The  objects  are 
prefixed  by  1  _  indicating  single  instances  (a  *  _  prefix  indi¬ 
cates  summary  instances  of  zero  or  more  —  i.e.,  created  in¬ 
side  a  control  structure).  The  reference  relation  is  redundant 
and  only  used  for  intermediate  processing.  We  can  safely 
abandon  it.  The  create  relation  means  that  an  object  creates 
another  object.  The  creation  relation  between  object  pairs  is 
propagated  to  discover  new  usage  relations  from  the  class 
relation  graph.  Therefore,  after  the  propagation,  only  the  us¬ 
age  relation  should  matter  for  the  partitioning:  if  an  object 
a  on  abstract  processor  Pa  uses  an  object  b  on  abstract  pro¬ 
cessor  Pb,  then  communication  may  be  generated.  We  also 
show  the  partition  number  within  square  parentheses  for  a 
two  way  partitioning  transformation.  For  details  on  the  ac¬ 
tual  algorithm  and  implementation,  please  refer  to  our  tech¬ 
nical  report  [2], 

3.  Graph  Partitioning 

The  next  transformation  our  system  performs  is  the 
graph  partitioning.  As  a  result,  this  phase  assigns  a  vir¬ 
tual  processor  number  to  each  object. 

A  multi-constraint  graph  partitioning  gives  an  optimal 
partitioning  of  the  object  dependence  graph  such  as  to  min¬ 
imize  the  cut,  and  thus  communication,  and  to  account  for 
the  resource  constraints  of  each  partition. 

Finding  an  optimal  multi-way  partition  for  large  graphs 
is  an  NP-complete  problem  (thus,  no  algorithm  that  solves 
the  problem  in  polynomial  time  exists).  However,  many 
heuristic-based  approaches  exist  [3,  12].  To  our  knowledge 
the  most  advanced  multilevel  partitioning  scheme  is  Hen¬ 
drickson  et  al.’s  [7]. 

We  use  Metis’  multi-objective,  multi-constraint  graph 
partitioning  algorithms  to  partition  the  dependence  graph. 
We  model  the  resources  for  the  object  dependence  graph 
as  follows.  Each  object  in  the  graph  encapsulates  data  and 
computation.  The  amount  of  data  it  encapsulates  charac¬ 
terizes  the  memory  usage,  while  the  amount  of  computa¬ 
tions  characterizes  the  CPU  usage.  The  weight  of  a  node  is 
a  vector  that  contains  memory,  CPU,  and  battery  usage  for 
the  creation  and  usage  of  an  object.  An  edge  between  two 
objects  indicates  a  potential  communication,  if  the  objects 
were  to  reside  in  two  different  address  spaces.  The  data  that 
needs  to  be  transferred  between  address  spaces  is  the  de¬ 
pendence  data  (i.e.,  field,  method  arguments  or  result).  The 
weight  of  an  edge  is  the  amount  of  data  that  needs  to  be 
transferred  due  to  a  dependence. 

We  use  static  approximations  of  resource  consumption 
to  guide  the  static  partitioning.  The  static  approximations 
can  be  imprecise  under  the  assumption  that  all  objects  have 
equal  weights.  In  the  future  we  plan  to  use  simple  heuris¬ 
tics;  for  example,  objects  created  inside  the  loops  can  be 
considered  “heavier”  than  single  instance  objects,  etc. 


Java : 

public  class  Example 
{ 

int  ex  (  int  b  )  { 
b  =  4;  //I 
if  (b  >  2) {  //  2 
b++;  //  3 

} 

return  b;  //  4 

} 

} 

Quad: 

BBO  (ENTRY)  (in:  <none>,  out:  BB2) 

BB2  (in:  BBO  (ENTRY),  out:  BB3,  BB4 ) 

1  MOVE_I  Rl  int,  IConst:  4 

2  IFCMP_I  IConst:  4,  IConst:  2,  LE,  BB4 

BB3  (in:  BB2 ,  out:  BB4 ) 

3  ADD_I  Rl  int,  IConst:  4,  IConst:  1 

BB4  (in:  BB2 ,  BB3 ,  out:  BBl  (EXIT)) 

4  RETURN_I  Rl  int 

BBl  (EXIT)  (in:  BB4,  out:  <none>) 

Figure  5.  Turning  a  Java  class  into  quads. 


In  our  current  implementation  we  have  written  a  Java 
wrapper  [10]  for  the  Metis  graph  partitioning  tool  [14].  The 
wrapper  implementation  (including  visualization  capabili¬ 
ties)  is  about  10000  lines  of  code. 

4.  Code  and  Communication  Generation 

Once  each  object  has  been  assigned  to  a  virtual  proces¬ 
sor,  the  program  can  be  distributed  by  mapping  virtual  pro¬ 
cessors  to  actual  processing  units  at  runtime.  There  are  two 
issues  related  to  the  distributed  execution.  First,  native  exe¬ 
cution  in  heterogeneous  environments  requires  retargetable 
code  generation.  Second,  correct  execution  requires  com¬ 
munication  to  satisfy  the  remote  dependences. 

To  address  retargetable  code  generation  we  use  the  quad 
high-level  intermediate  representation  to  generate  Abstract 
Syntax  Trees  (AST)  and  then  use  bottom-up  rewrite  system 
(BURS)  [18]  to  emit  code  for  a  range  of  architectures  (cur¬ 
rently  x86  and  StrongARM). 

To  address  communication  generation,  we  use  the  depen¬ 
dence  and  partitioning  information  to  classify  objects  as  lo¬ 
cal  and  dependent.  Local  objects  have  no  dependences  on 
objects  in  different  address  spaces.  Thus,  they  are  treated 
as  normal  objects  and  no  communication  is  generated  for 
those.  Dependent  objects  have  dependences  across  address 
spaces  and  thus,  messages  are  inserted  to  resolve  these  de¬ 
pendences. 

4.1.  Retargetable  Code  Generation 

The  input  for  this  phase  is  the  quad  intermediate  repre¬ 
sentation.  The  result  is  a  generated  set  of  compilers  for  vari¬ 
ous  target  machines.  An  example  of  the  quad  format  is  listed 
as  Figure  5,  along  with  the  Java  class  that  was  used  to  gen¬ 
erate  the  code. 


Figure  6.  A  Tree  representation  of  the  quads. 


Abstract  Syntax  Tree.  Once  the  quad  source  is  estab¬ 
lished,  the  program  is  then  turned  into  an  Abstract  Syn¬ 
tax  Tree  to  act  as  the  code  generator  front-end.  The  AST 
is  structured  such  that  each  instruction  acts  as  a  root  node, 
with  instruction  parameters  represented  as  child  leaves.  The 
tree  generator  used  is  called  ANTLR  [16],  and  is  a  gram¬ 
mar  parser  similar  to  Yacc.  A  visual  representation  of  this 
tree  can  be  seen  as  Figure  6. 

Because  of  the  inherent  simplicity  in  the  quad  format, 
it  is  feasible  that  a  simple,  linear  parser  be  written  from 
scratch  and  a  code  generator  built  on  top  of  it.  Though  that 
approach  may  perform  faster  and  can  be  more  specialized  to 
this  task,  using  the  tree  allows  extensibility.  This  would  al¬ 
low  the  code  generator  to  be  used  with  any  intermediate  rep¬ 
resentation  or  source  language  as  creating  a  tree  allows  us 
to  completely  abstract  the  source. 

Bottom-Up  Rewrite  Generation.  After  obtaining  the 
tree  representation  of  the  source,  the  remaining  work  is 
done  through  the  back-end  and  is  handled  through  a  method 
called  Bottom-Up  Rewrite  Machine  Generation,  or  BURM. 
This  does  two  passes  of  the  incoming  AST:  an  initial  pass 
to  find  a  minimum-cost  traversal,  followed  by  a  second  pass 
that  emits  code  based  on  the  instructions  represented  in 
each  node.  The  specific  machine  generator  is  called  JBurg,  a 
Java-based  BURG  (Bottom-Up  Rewrite  Generator)  [5]  that 
differs  from  other  BURM  implementations  in  that  it  tra¬ 
verses  the  tree  employing  dynamic  programming  pattern 
matching  to  satisfy  goals.  Two  examples  of  machine  code 
emitted  by  the  BURG  are  as  Figure  7. 

4.2.  Communication  Generation 

To  generate  communication,  we  generate  partitions  off¬ 
line  for  1,  2,  ...  nodes.  This  is  a  form  of  off-line  rather  than 
runtime  specialization. 

Each  node  in  the  object  graph  has  a  unique  identifier  that 
contains  a  virtual  processor  number.  Communication  is  in¬ 
serted  only  for  dependent  objects.  That  is,  for  each  depen¬ 
dence  relation  to  a  remote  object  two  calls  are  generated: 
a  send  call  that  packs  the  access  type  and  associated  data, 
and  a  receive  call  that  fetches  the  response.  For  each  de¬ 
pendence  relation  from  a  remote  object,  two  calls  are  gener¬ 
ated:  a  receive  call  that  processes  the  access  type  and  asso¬ 
ciated  data  and  a  send  call  that  sends  the  results  of  the  ac¬ 
cess  back. 


x86 : 

mov  eax,  4  ;  1 
cmp  4,  2  ;  2a 
jle  BB4  ;  2b 
mov  eax,  4  ;  3a 
add  eax,  4  ;  3b 

BB4  : 

ret  eax  ;  4 
StrongARM: 

mov  Rl,  #4  ;  1 
cmp  #4,  #2  ;  2a 
ble  BB4  ;  2b 
add  Rl,  4,  4  ;  3 

.  BB4 

mov  PC,  R14  ;  4 

Figure  7.  Machine  code  for  two  separate  ar¬ 
chitectures 


Original  byte-code: 

13:  aload  //load  Account  object 

14:  invokevirtual  Account . getSavings :( ) 

Transformed  byte-code: 

13:  aload  //load  DependentOb ject  object 

14:  ldc  INVOKE_METHOD_HASRETURN  (int)  //access  type 

16:  ldc  "getSavings"  //load  method  name 

18:  aconst_null  //no  method  argument  for  getSavings () 

19:  invokevirtual  DependentOb ject . access 

22:  checkcast  Integer  //cast  to  return  type 

25:  invokevirtual  Integer . intValue  //get  primitive  value 

Figure  8.  The  transformation  for  method  invo¬ 
cation  account . getSavings ( ) ; . 


The  dependences  handled  by  our  current  implementation 
are  object  accesses,  including  field  accesses,  and  method  in¬ 
vocations.  For  each  dependent  object  that  is  referred  from 
remote,  there  is  a  corresponding  DependentOb  ject 
that  performs  Message  Passing  Interface  (MPI)  com¬ 
munication  with  the  home  node  of  the  referring  object. 
Distributed  dependences  are  therefore  transformed  to  ac¬ 
cesses  to  DependentOb  ject  instances. 

Figure  8  illustrates  the  original  and  trans¬ 
formed  bytecode  snippets  for  method  invocation 
account .  getSavings  () .  The  transformation  for 
method  invocations  performs  three  tasks:  prepare  the  argu¬ 
ments  for  the  DependentOb  ject  access,  prepare  the  ar¬ 
guments  (in  a  LinkedList)  for  the  original  method  call, 
and  cast  the  return  value  (Object  type)  to  the  appropri¬ 
ate  class  type  or  primitive  value.  The  transformation  for 
field  accesses  are  similar. 

The  remote  instantiation  of  a  dependent  class  is  trans¬ 
lated  to  an  instantiation  of  a  DependentOb  ject,  which 
in  turn  will  communicate  via  MPI  messages  to  the  home 
node  of  the  dependent  object.  The  home  node  will  then  cre¬ 
ate  the  object.  Figure  9  demonstrates  the  transformation  of 
new  Account  (i,  n,  s,  c ).  The  information  passed 
to  the  MPI  message  for  distributed  instantiation  and  com- 


Code  Partition 


Original  byte-code: 


35 

new  Account 

38 

dup 

39 

iload_2 

//i 

40 

aload_3 

//n 

41 

iload  4 

//s 

43 

iload  5 

// c 

45 

invokespecial  Account 

Transformed  bytecode: 

35:  new  DependentOb ject 
3  8 :  dup 

39:  iload_2  //i 

40:  aload_3  //n 

41:  iload  4  //s 

43:  iload  5  //c 

//.... 

//  prepare  the  constructor  arguments 
//.... 

105:  ldc  0  (int)  //location  of  Account,  NodeO 
107:  ldc  "Account"  (String) 

109:  aload  6  //constructor  arguments  in  a  list 

111:  invokespecial  DependentOb ject , "<init>" 


Code  Partition 


Code  Partition 


User 


Figure  10.  The  organization  of  runtime  ser¬ 
vices  for  distributed  execution. 


Figure  9.  The  transformation  for  new 

Account (i,  n,  s,  c)  ; 


prises  of  the  class  name  and  the  arguments  to  the  class  con¬ 
structor. 

The  quality  of  communication  generation  is  directly  in¬ 
fluenced  by  the  quality  of  dependence  analysis.  Our  anal¬ 
ysis  is  type-based  and  thus,  not  very  precise.  More  pre¬ 
cise  dependence  information  makes  use  of  points-to  infor¬ 
mation  [19]  in  the  context  of  speculative  multithreading. 
In  addition,  there  are  several  communication  optimization 
techniques  that  can  be  applied  to  optimize  communication 
generation:  message  aggregation,  hoisting  communication 
out  of  the  loop,  asynchronous  communication,  overlapping 
communication  and  computation,  data  replication,  and  early 
prefetch.  Many  of  these  techniques  cannot  be  used  with  re¬ 
quest/response  communication  style  like  RPC  or  RMI.  In 
contrast,  we  use  message  exchange  communication  to  re¬ 
veal  more  optimization  opportunities. 

5.  Distributed  Execution 

The  distributed  target  code  partitions  are  executed  within 
the  MPI  enhanced  runtime  environment.  Currently  we  use 
JVM  hosted  execution  rather  than  native  execution.  Even 
though  the  retargetable  code  generation  component  is  fully 
implemented,  it  was  easier  to  use  normal  JVM  since  our 
current  experiments  are  conducted  on  resource-rich  x86 
platforms.  Also,  the  use  of  JVM  does  not  affect  our  current 
distributed  execution  evaluation  (speed-up  measurements). 

In  our  current  implementation,  on  each  node  there 
are  three  supporting  services:  the  MPI  service,  the 
ExecutionStarter  service,  and  the  Message 
Exchange  service.  Figure  10  depicts  this  organiza¬ 
tion  of  the  runtime  services  for  distributed  execution.  The 
MPI  service  sets  up  the  necessary  MPI  working  environ¬ 


ment  —  such  as  groups,  communicators,  and  the  commu¬ 
nication  context. 

The  Execution  Starter  service  starts  the  applica¬ 
tion  by  invoking  the  main  ()  method  of  the  application 
class.  Only  one  copy  of  Execution  Starter  needs  to 
be  active  on  the  processor  node  in  the  distributed  execution 
environment  where  the  user  initiates  the  application. 

The  core  of  this  MPI-aware  runtime  support  is 
the  Message  Exchange  service.  This  service  pro¬ 
cesses  all  the  send  and  receive  MPI  communica¬ 
tion  generated  from  the  object  dependence  information. 
The  Message  Exchange  service  uses  two  support¬ 
ing  data  structures.  One  is  the  DependentOb  ject 
and  the  other  is  the  exchanged  Message.  The  run¬ 
time  uses  the  DependentOb  ject  (implemented  by  a 
Java  class)  to  indicate  an  object  that  has  dependence  rela¬ 
tions  to  another  partition. 

Each  dependent  object  contains  the  following  informa¬ 
tion:  its  class  type,  the  identifier  of  the  partition  (node)  that 
hosts  the  object,  and  its  unique  identifier  in  that  partition 
(node).  A  message  (packed  in  a  Message  structure)  ex¬ 
changed  between  two  dependent  objects  across  two  nodes 
contains  the  object  identifier  of  the  receiver  of  the  commu¬ 
nication  and  the  relevant  dependence  data.  The  Message 
Exchange  service  passes  objects  between  nodes  using  a 
streamed  format. 

We  currently  identify  two  types  of  messages:  NEW  and 
DEPENDENCE  for  object  instantiation  and  data  depen¬ 
dence.  We  are  in  the  process  of  defining  more  precise 
dependence  relations  (e.g.,  read  after  write),  and  discrimi¬ 
nating  further  between  messages. 

6.  Profiler 

We  have  built  a  profiler  that  collects  statistics  indicating 
the  resource  consumption  of  a  program  during  runtime. 


The  profiler  is  built  on  top  of  the  Joeq  compiler  and  vir¬ 
tual  machine.  The  profiler  works  either  through  instrumen¬ 
tation  or  sampling.  Some  of  the  metrics  can  be  implemented 
using  either  technique.  In  these  cases,  the  instrumentation  is 
useful  as  a  baseline  for  comparison  of  the  accuracy  of  the 
sampling.  There  are  four  basic  categories  of  runtime  appli¬ 
cation  behavior  we  are  interested  in:  CPU,  memory,  bat¬ 
tery,  and  communication  (i.e.,  network)  usage.  To  measure 
these  four  basic  categories,  we  have  currently  implemented 
six  metrics:  method  duration,  method  frequency,  hot  meth¬ 
ods,  hot  paths,  memory  allocation,  and  dynamic  call  graph. 

The  method  duration  metric  measures  the  amount  of 
time  each  method  took  to  execute.  The  metric  was  origi¬ 
nally  implemented  by  overloading  the  method  invocation 
process  of  the  built-in  native  2  interpreter.  The  time  of  entry 
and  exit  of  each  method  (both  system-level  and  user-level) 
are  recorded  in  a  profiling  class.  Unfortunately,  due  to  prob¬ 
lems  within  Joeq  itself,  this  metric  on  our  test  benchmarks 
had  to  be  measured  with  Java  source  level  instrumentation. 
See  Section  7.3  for  details. 

The  method  frequency  metric  measures  how  often  each 
method  is  invoked.  This  metric  can  also  be  used  as  a  less  ex¬ 
pensive  substitute  for  the  method  duration  metric.  A  counter 
is  associated  with  each  method  that  kept  track  of  the  num¬ 
ber  of  invocations.  However,  also  like  the  method  duration 
metric,  source  level  instrumentation  had  to  be  performed  in¬ 
stead. 

The  hot  methods  metric  minimizes  the  overhead  of  the 
previous  metric  by  using  sampling.  For  each  native  thread 
Joeq  spawns  it  also  attaches  a  separate  native  interrupter 
thread.  The  interrupter  thread’s  main  task  is  to  signal  the 
thread  queue  when  to  switch  threads.  This  provides  a  con¬ 
venient  approach  to  sampling;  simply  pass  control  from  the 
interrupter  thread  to  the  profiler  at  each  scheduling  time 
quantum.  The  profiler  then  obtains  the  currently  executing 
method  by  reading  the  call  stack  of  the  thread  and  record¬ 
ing  the  top  stack  frame. 

The  hot  paths  metric  goes  a  level  above  the  hot  meth¬ 
ods  metric  in  its  scope  and  measures  the  hottest  execution 
paths  through  the  application.  We  extend  the  hot  method 
technique,  and  we  sample  the  entire  call  stack  instead  of 
sampling  only  the  top  stack  frame. 

The  memory  allocation  metric  is  implemented  by  di¬ 
rectly  modifying  the  internal  Java  virtual  machine  system 
code  of  Joeq.  By  overloading  some  of  the  methods  that  im¬ 
plement  memory  allocation,  we  can  estimate  the  memory 
profile  of  the  application  without  performing  instrumen¬ 
tation.  Unfortunately,  this  metric  is  currently  only  a  very 
rough  approximation,  but  we  are  confident  that  much  bet¬ 
ter  accuracy  will  be  achieved  in  the  near  future. 


2  ’’Native”  in  context  of  Joeq  means  it  bootstraps  itself  into  a  fully  func¬ 
tional  JVM  without  the  need  for  a  host  JVM  to  support  it. 


benchmark 

size 

CRG 

ODG  | 

#C 

#M 

KB 

#N 

#E 

EC 

#N 

#E 

EC 

create* 

14 

28 

13 

17 

6 

2 

210 

632 

82 

method* 

6 

35 

10 

12 

10 

2 

9 

32 

2 

crypt* 

6 

45 

12 

13 

13 

3 

11 

33 

1 

heapsort* 

6 

42 

10 

13 

13 

3 

11 

33 

2 

moldyn* 

8 

48 

17 

12 

15 

2 

9 

32 

2 

search* 

9 

57 

17 

14 

23 

3 

6 

20 

3 

cmprss** 

38 

295 

160 

36 

42 

1 

32 

107 

2 

db** 

32 

299 

155 

32 

26 

2 

49 

164 

8 

*  Java  Grande  benchmarks:  JGFCreateBench  and  JGFMethodBench  (section  1), 
JGFCryptBench  and  JGFHeapSortBench  (section  2),  JGFMolDynBench  and 
JGFSearchBench  (section  3). 

**  SPEC  JVM98  benchmarks:  _201  .compress,  and  _209_db. 

Table  1.  The  size  of  the  benchmarks  (number 
of  classes,  methods,  and  KB)  and  the  sizes  of 
the  resulting  graphs  (the  number  of  nodes, 
edges,  and  the  edgecut  for  both  CRG  and 
ODG). 


The  dynamic  call  graph  metric  shows  the  methods  that 
actually  got  called  in  a  specific  application  instance.  It  was 
measured  using  sampling.  It  also  makes  use  of  similar  data 
as  the  hot  paths  metric,  but  processes  the  data  in  a  different 
manner  to  actually  construct  the  dynamic  call  graph. 

7.  Evaluation 

We  have  implemented  a  functional  infrastructure  proto¬ 
type  that  realizes  the  components  presented  in  the  above 
sections.  We  evaluate  the  functionality  and  the  performance 
of  our  prototype  with  a  set  of  benchmarks  from  Java  Grande 
benchmark  suite  and  SPEC  JVM98  (see  Table  1).  In  our 
experiments  the  networked  configuration  includes  a  ser¬ 
vice  node,  1.7GHz  Pentium  III  machine  (512MB  RAM, 
SuSE9.1),  and  another  computation  node,  a  800MHz  Pen¬ 
tium  III  (384MB  RAM,  Redhat  9.0).  Both  nodes  run  JDK 
1.4.  The  two  nodes  are  connected  via  100M  Ethernet.  At  the 
time  of  this  publication  we  did  not  have  access  to  other  net¬ 
worked  configurations  and  we  only  experimented  with  the 
few  computers  we  had  access  to.  However,  in  the  future,  we 
plan  to  set  up  a  network  consisting  of  multiple  nodes  with 
significant  differences  in  resources  and  configurations. 

7.1.  Dependence  Graph  Construction 

Table  1  shows  the  sizes  of  the  original  benchmarks  as 
well  as  the  resulting  class  relation  graph  (CRG)  and  object 
dependence  graph  (ODG)  for  each  benchmark.  The  edge- 
cut  is  the  number  of  edges  that  straddle  partitions. Currently 
we  use  the  class  relation  graph  partitioning  to  distribute  the 
program. 


benchmark 

construct 

partition 

rewrite 

CRG 

ODG 

TRG 

ODG 

create 

2043 

3056 

7 

12 

271 

method 

1704 

53 

7 

6 

202 

crypt 

1715 

40 

7 

7 

209 

heapsort 

1615 

54 

6 

7 

193 

moldyn 

1903 

114 

6 

6 

215 

search 

1868 

49 

7 

7 

204 

compress 

2305 

100 

6 

7 

285 

db 

2434 

99 

10 

7 

280 

Table  2.  The  execution  time  breakdown  in 
code  distribution.  The  columns  indicate  the 
construction  time,  the  partitioning  time,  and 
the  bytecode  rewriting  time 


The  execution  times  for  graph  construction  and  distribu¬ 
tion  transformation  are  shown  in  Table  2,  in  milliseconds. 
We  can  see  that  the  static  analysis  of  the  class  relations  is  in 
the  order  of  seconds.  This  is  because  the  process  to  extract 
high-level  dependence  information  from  the  low-level  byte¬ 
code  format  is  computation  and  time  consuming.  However 
since  this  process  only  happens  once  at  compiler-time,  it  is 
not  as  crucial  as  the  other  phases  in  the  dynamic  repartition¬ 
ing  process  —  ODG  construction,  partitioning,  and  code 
rewriting.  In  these  latter  phases,  only  partitioning  has  to 
be  completely  re-executed  in  each  adaptive  iteration.  ODG 
construction  and  code  rewriting  can  be  both  adjusted  incre¬ 
mentally.  Since  the  partition  time  is  only  about  10ms,  we 
believe  that  the  results  are  promising  for  our  future  plans  of 
incorporating  adaptive  repartitioning.  Also,  Create  bench¬ 
mark  has  an  unusual  long  ODG  construction  time.  This  is 
because  it  creates  a  large  amount  of  objects  which  substan¬ 
tially  complicate  the  object  graph. 


7.2.  Distributed  Execution 

To  evaluate  the  performance  of  the  distributed  execu¬ 
tion  runtime,  we  compare  the  distributed  execution  time 
of  the  transformed  benchmarks  with  the  execution  time  of 
the  original  sequential  benchmarks  on  the  800MHz  Pen¬ 
tium  III  machine.  The  execution  speedup  is  depicted  in  Fig¬ 
ure  1 1 .  The  distributed  execution  shows  comparable  or  im¬ 
proved  performance  (79.2%  to  175.2%)  with  the  original 
sequential  execution.  The  results  are  promising,  since  with¬ 
out  any  further  optimization  the  distributed  execution  re¬ 
sults  in  very  little  overhead  (in  Method  and  Compress ),  or 
speed-up.  Since  we  currently  use  a  suboptimal  naive  parti¬ 
tioning,  it  is  expected  that  further  performance  gain  will  be 
achieved  if  optimization  is  introduced  to  the  distribution  in¬ 
frastructure  in  our  future  work. 


Create  Method  Crypt  Heapsort  Moldyn  Search  Com-  Db 


Figure  11.  Performance  comparison  of  cen¬ 
tralized  and  distributed  executions. 


7.3.  Profiling 

We  evaluate  the  profiler  for  a  a  subset  of  the  Java  Grande 
Forums  benchmarks.  For  the  baseline  measurements,  Joeq 
runs  each  of  the  benchmarks  with  all  the  profiling  code 
compiled  in,  but  not  enabled.  Then  each  of  the  profilers 
is  enabled  in  turn.  The  tests  were  conducted  on  an  AMD 
Athlon  XP  2000+  (1.67  GHz)  with  512  MB  RAM  running 
Windows  XP.  In  each  of  the  tests,  Joeq  was  allocated  a  max¬ 
imum  heap-size  of  1024  MB. 

Table  3  shows  the  total  execution  times  for  each  of  the 
benchmarks  and  profilers.  The  average  overhead  for  all  the 
profilers  is  21.94%.  A  general  trend  is  that  metrics  which 
were  measured  with  instrumentation  overall  incurred  no¬ 
tably  higher  overhead  than  did  the  others,  which  used  either 
sampling  or  modification  of  the  JVM  system  code.  The  hot 
paths,  dynamic  call  graph,  and  memory  usage  metrics  all 
incurred  about  equal  levels  of  overhead,  approximately  14- 
20%.  The  most  impressive  results  came  from  the  hot  meth¬ 
ods  metric,  which  at  approximately  4%  is  a  very  good  re¬ 
sult. 

8.  Related  Work 

There  are  two  types  of  automatic  distribution  compilers 
or  virtual  machines  available:  automatic  distribution  to  ex¬ 
ploit  data  parallelism  in  scientific  programs  and  automatic 
partitioning  of  Java  programs  to  relieve  resources  on  con¬ 
strained  devices. 

Automatic  Distribution  of  Data  Parallel  Programs. 

Automatic  parallelization  is  one  research  area  that  has  in¬ 
vestigated  the  partitioning  problem  mainly  for  scientific 
programs  typically  targeting  a  significant  reduction  in  CPU 
or  memory  consumption  [8,  13,  1,  6,  11].  There  are  two 
main  differences  between  partitioning  for  scientific  appli¬ 
cations  and  our  work.  First,  most  of  the  previous  work  fo¬ 
cuses  on  array  partitioning,  or  loop  iteration  partitioning  for 


Test/Metric 

Baseline 

Hot 

Paths 

Dynamic 

Call 

Graph 

Hot 

Meth¬ 

ods 

Method 

Du¬ 

ra¬ 

tion 

Method 

Fre¬ 

quency 

Memory 

Us¬ 

age 

CreateBench 
(int  []) 

4.406 

5.125 

5.375 

5.468 

4.734 

5.937 

9.718 

CreateBench 
(long [ ] ) 

18.250 

28.046 

28.640 

19.281 

25.140 

31.062 

35.000 

CreateBench 
(float [] ) 

4.468 

6.437 

5.906 

4.265 

5.015 

4.659 

6.015 

CreateBench 
(Object [] ) 

2.156 

2.421 

2.468 

2.328 

2.296 

2.203 

2.281 

CreateBench 
(Custom [ ] ) 

10.718 

12.687 

12.500 

11.484 

11.875 

11.234 

51.406 

MethodBench 

196.187 

212.140 

222.359 

202.281 

323.437 

248.156 

198.937 

FFTA 

32.187 

37.609 

40.765 

33.812 

35.781 

36.546 

34.312 

HeapSortA 

3.906 

4.296 

4.968 

4.281 

17.297 

14.328 

3.968 

MolDynA 

48.234 

53.062 

57.390 

50.234 

51.375 

5 1 .750 

50.125 

MonteCarloA 

48.734 

59.859 

58.890 

51.015 

75.194 

60.234 

49.671 

Total : 

369.734 

421.682 

439.261 

384.449 

552.144 

466.109 

441.433 

Overhead: 

0.00% 

14.05% 

18.80% 

3.98  % 

49.34% 

26.07% 

19.39% 

Table  3.  The  profiler  evaluation.  Each  row  is  the  individual  benchmark,  while  each  column  is  the  name 
of  the  profiler  enabled.  The  last  row  is  the  total  time  it  took  to  execute  all  the  benchmarks.  The  times 
are  given  in  seconds.  The  baseline  column  is  the  execution  times  with  all  the  profiling  code  com¬ 
piled  in  but  not  enabled. 


scientific  programs.  We  address  general  program  distribu¬ 
tion,  where  all  the  objects  in  a  program  are  of  interest.  Sec¬ 
ond,  the  main  objective  for  partitioning  in  scientific  pro¬ 
grams  is  to  speedup  execution,  either  on  distributed  or  on 
shared  memory  machines.  Our  design  choices  are  motivated 
by  the  ability  to  model  multiple  resources  and  study  their  in¬ 
teraction.  Then,  the  general  distribution  can  be  specialized 
at  runtime  depending  on  resource  priorities  and  actual  envi¬ 
ronment. 

Automatic  Distribution  of  Java  Programs.  Java- 
Party  [17]  extends  Java  with  remote  objects.  The  objec¬ 
tive  is  to  provide  location  transparency  in  a  distributed 
memory  environment.  In  contrast,  we  achieve  the  trans¬ 
parency  effect  without  extending  Java  syntax.  However,  we 
do  not  give  the  user  any  control  over  distribution. 

Messer  et  al.’s  approach,  though  entirely  dynamic,  has 
an  objective  that  more  closely  matches  our  own  [15].  The 
goal  is  to  transparently  off-load  services  to  relieve  mem¬ 
ory  and  processing  constraints  on  resource-constrained  de¬ 
vices.  The  main  difference  is  the  handling  of  object  refer¬ 
ences.  In  this  approach  each  JVM  maps  all  other  JVMs  ref¬ 
erences,  and  thus  it  results  in  a  replicate  all  strategy.  Our  ap¬ 
proach  is  partly  static,  and  it  considers  just  some  of  the  in¬ 
teractions  between  objects  (cross  processor). 

Another  approach,  similar  to  the  distributed  shared  mem¬ 
ory  paradigm,  is  to  implement  a  distributed  JVM  as  global 
object  space  [4],  We  achieve  the  same  transparency  effect 
at  hopefully  lower  cost,  since  we  distinguish  between  lo¬ 
cal  and  remote  accesses. 

J-orchestra  [21]  transforms  Java  bytecode  into  dis¬ 
tributed  Java  applications.  This  is  also  an  abstract  shared 


memory  implementation.  The  communication  is  syn¬ 
chronous  only  —  i.e.,  RMI.  To  exploit  asynchronous  com¬ 
munication,  we  use  automatically  generated  point-to-point 
messages. 

Pangaea  [20]  is  a  system  that  can  distribute  Java  pro¬ 
grams  using  arbitrary  middleware  (Java  RMI,  CORBA)  to 
invoke  objects  remotely.  The  system  is  based  on  the  origi¬ 
nal  algorithm  by  Spiegel  which  was  the  basis  for  our  own 
extended  algorithm  [2].  Pangaea’s  input  is  a  centralized 
Java  source-code  program.  The  result  is  a  distributed  pro¬ 
gram  underlying  the  synchronous  remote  method  invoca¬ 
tion  communication  paradigm.  Our  approach  starts  from 
Java  bytecode  and  targets  a  flexible  distribution  model  (i.e., 
allows  the  exploitation  of  concurrency  and  asynchronous 
communication)  in  a  program. 

Coign  [9]  is  also  a  system  that  strives  to  automatically 
partition  binary  programs  (built  from  COM  components) 
for  optimal  execution.  Coign  is  designed  to  handle  2-way 
partitioning  only  (between  two  nodes)  for  client-server  dis¬ 
tributions.  Also,  the  distribution  is  fully  dynamic,  based  on 
profiling  history.  We  combine  static  analysis  with  off-line 
distributions  in  a  general,  multi-way  partitioning. 

9.  Conclusion 

This  paper  presented  the  design  and  implementation  of  a 
research  compiler  and  runtime  infrastructure  for  automatic 
program  distribution.  While  not  all  programs  can  benefit 
from  automatic  distribution,  we  believe  that  it  is  important 
to  be  able  to  model  the  resources  of  a  program  and  study 
the  effect  of  distribution  on  program  behavior  with  respect 


to  resource  consumption.  The  motivating  factor  to  our  de¬ 
sign  was  flexibility  and  modularity.  Thus,  we  expect  each 
of  the  techniques  we  presented  to  evolve  as  more  experi¬ 
ments  are  conducted. 

Our  design  is  based  on  two  key  ideas:  find  the  depen¬ 
dences  between  the  objects  in  a  program  and  use  this  infor¬ 
mation  to  automatically  generate  communication.  We  have 
shown  how  we  cast  the  resource  modeling  and  program  dis¬ 
tribution  problem  into  an  optimal  graph  partitioning  prob¬ 
lem.  We  model  the  resources  as  weights  on  the  dependence 
graph  and  then  experiment  with  multiple  resource  priori¬ 
ties  and  constraints.  We  have  presented  the  code  generation 
phase  as  two  separate  parts:  platform  independent  code  gen¬ 
eration  and  communication  generation. 

We  have  also  described  a  profiler  system  that  allows  us  to 
collect  information  about  the  program  behavior  and  eventu¬ 
ally,  be  able  to  redistribute  the  program  according  to  the  ac¬ 
tual  access  patterns  and  resource  requirements.  Our  present 
infrastructure  only  handles  static  partitioning.  While  dy¬ 
namic  repartitioning  is  the  goal  of  our  next  design  iteration, 
it  does  not  influence  the  design  of  the  infrastructure  pre¬ 
sented  in  this  paper. 

Finally,  we  have  presented  results  on  each  of  the  tech¬ 
niques  that  we  have  introduced.  The  results  indicate  that 
partitioning  takes  little  time  and  the  computed  dependence 
graphs  are  within  manageable  sizes.  We  have  also  shown 
that  without  any  further  tuning,  the  distributed  execution  re¬ 
sults  in  either  a  very  small  overhead  or  a  speed-up.  Finally, 
we  have  evaluated  our  profiler  system  in  terms  of  the  in¬ 
curred  overhead  as  well  as  collected  data. 
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